Hive Partition

Hive is a good tool for performing queries on large datasets, especially datasets that require full table scans. But quite often there are instances where users need to filter the data on specific column values. Basically, for the purpose of grouping similar type of data together on the basis of column or partition key, Hive organizes tables into partitions. Moreover, to identify a particular partition Each Table can have one or more partition keys. in a Hive table, Partitioning provides granularity. Hence, by scanning only relevant partitioned data instead of the whole dataset it reduces the query latency.

Why partition required?
  • In non-partitioned tables, Hive would have to read all the files in a table’s data directory and subsequently apply filters on it. This is slow and expensive specially in cases of large tables.
  • Partitions are essentially horizontal slices of data which allow larger sets of data to be separated into more manageable chunks.
  • When we submit a SQL query, Hive read the entire data-set. So, it becomes inefficient to run MapReduce Jobs over a large table. Thus this is resolved by creating partitions in tables.Hive makes it very easy to implement partitions by using the automatic partition scheme when the table is created.
  • Partitioning can be done based on one or more than one columns to impose multi-dimensional structure on directory storage. 
Let say table student_details containing the student information of some engineering college like student_id, name, department, year, etc. Now, if I perform partitioning based on department column, the information of all the students belonging to a particular department will be stored together in that very partition. Physically, a partition is nothing but a sub-directory in the table directory. Let’s say we have data for three departments in our student_details table – CSE, ECE and Civil. Therefore, we will have three partitions in total for each of the departments as shown in the image below. And, for each department we will have all the data regarding that very department residing in a separate sub – directory under the Hive table directory. For example, all the student data regarding CSE departments will be stored in user/hive/warehouse/student_details/dept.=CSE. So, the queries regarding CSE students would only have to look through the data present in the CSE partition. This makes partitioning very useful as it reduces the query latency by scanning only relevant partitioned data instead of the whole data set. In fact, in real world implementations, you will be dealing with hundreds of TBs of data. So, imagine scanning this huge amount of data for some query where 95% data scanned by you was un-relevant to your query.

Syntax

CREATE [EXTERNAL] TABLE table_name
(
  col_name_1 data_type_1, ....
)
PARTITIONED BY (col_name_n data_type_n [COMMENT col_comment], ...);
  • In Hive, partitioning is supported for both managed and external tables.
  • Partition column is a virtual column that does not exist on the file as a column. HIVE queries can take advantage of the partitioned data for better performance.Partitioned tables can be created using PARTIONED BY clause at the time of table creation.
Types of Hive Partitioning
  • Static Partitioning
  • Dynamic Partitioning
Advantages
  • Partitioning is used for distributing execution load horizontally.
  • As the data is stored as slices/parts, query response time is faster to process the small part of the data instead of looking for a search in the entire data set.
  • For example, In a large user table where the table is partitioned by country, then selecting users of country ‘IN’ will just scan one directory ‘country=IN’ instead of all the directories. 
Limitations
  • Having large number of partitions create number of files/ directories in HDFS, which creates overhead for NameNode as it maintains metadata.
  • It may optimize certain queries based on where clause, but may cause slow response for queries based on grouping clause.
  • In Mapreduce processing, Huge number of partitions will lead to huge no of tasks (which will run in separate JVM) in each mapreduce job, thus creates lot of overhead in maintaining JVM start up and tear down. For small files, a separate task will be used for each file. In worst scenarios, the overhead of JVM start up and tear down can exceed the actual processing time.
Best practices for partition tables and query on partitioned tables
  • Never partition on a unique ID.
  • Size partitions so that on average they are greater than or equal to 1 GB.
  • Formulate a query so that it does not process more than 1000 partitions.
Pointers
A few things to keep in mind when using partitioning:
  • It’s important to consider the cardinality of the column that will be partitioned on. Selecting a column with high cardinality will result in fragmentation of data and put strain on the name node to manage all the underlying structures in HDFS.
  • Do not over-partition the data. With too many small partitions, the task of recursively scanning the directories becomes more expensive than a full table scan of the table.
  • Partitioning columns should be selected such that it results in roughly similar size partitions in order to prevent a single long running thread from holding up things.
  • If hive.exec.dynamic.partition.mode is set to strict, then you need to do at least one static partition. In non-strict mode, all partitions are allowed to be dynamic.
  • If your partitioned table is very large, you could block any full table scan queries by putting Hive into strict mode using the set hive.mapred.mode=strict command. In this mode, when users submit a query that would result in a full table scan (i.e. queries without any partitioned columns) an error is issued.

No comments:

Post a Comment